This page is organized as follows.
In Part 1, we create the multilingualism plots based on Multidagestan data (Dobrushina et al. 2017). In Part 2, we use the loanword data and calculate the numbers of sharings between the East Caucasian wordlists and donor languages. We use regression analysis and a series of t-tests to test our claims for statistical significance. Part 3 runs permutation tests. To conduct the permutation tests, create a new sample of wordlists based on our data but with pan-Daghestanian probabilities of attesting particular lexemes in the lists and test our real data against this gainst this sample. The premutation block consists of two parts: Avar and Azerbaijani influence vs. Chechen and Georgian influence, the two parts are more or less the same, apart from the data sampling, this is why the second part is commented in less detail.
The plots in this supplement appear in the order they are created, which differs from the order in the paper. This decision was made to increase the readability of the code. For convenience, the figures here are provided with their numbers in the paper.
In Fig. 3, the plots show bilingualism rates for speakers of the local languages with each of Turkic, Avar, Georgian and Chechen. The data are again split by the three areas. Dots represent villages, with a horizontal line showing the median value per district. The Y axis shows the percentage of people born before 1919 bilingual in the respective language.
The dataset below contains the percentages of speakers of each L2 in the three regions.
The scatter plot in Fig. 4b shows bilingualism rates in Avar and Turkic (Azerbaijani in the South and Kumyk in the North). The dots represent villages. In both cases, Avar is shown on the X axis, and Turkic is shown on the Y axis. Both charts show a very clear divide between the South, the domain of Azerbaijani, and the North, the Avar fiefdom.
In Fig. 2a and b, the four plots show lexical influence from Turkic, Avar, Georgian and Chechen. The data are split by the three areas on the X axis. The Y axis shows the percentage of the loanwords found in the elicitations. Dots represent elicitations, with a horizontal line showing the median value per district. Fig. 2a and 2b are different in that the former shows the percentage of the total number of loanwords from the respective donor, including loanwords this donor shares with other donors. We thus cannot be a priori sure which language they come from. Fig. 2b only shows the percentage of ‘unshared’ loanwords which are only attested in this donor but not in the other donors. The disadvantage of this way of counting is that it may underestimate the amount of lexical borrowing from the relevant donor. In Fig. 2b, all figures drop, but they do so most conspicuously in the case of Turkic influence in the North, the fact we come back to in the end of this section. For now, we may interpret the counts in Fig. 2a and the counts in Fig. 2b as upper (liberal) and lower (conservative) estimates, and trust that the true values lie somewhere between them.
The dataset below contains the percentages of loans from each L2 in the three regions.
Each plot in Fig. 8 shows two distributions of elicitations according to the percentage of loanwords. The higher one is based on the total number of loanwords and is copied from Fig. 2a. The lower one is based on the number of loanwords that are not shared with any other donor and is copied from Fig. 2b. In each case, the true amount of loanwords lies somewhere in between. But we have highlighted as darker the distribution to which, we believe, the true values are closer. In one case we do not think we have not enough evidence to decide.
The scatter plot in Fig. 2a shows the amount of loanwords from Avar and Turkic per elicitation (loanwords shared by several potential donors are excluded from the counts, as in Fig. 2b above). The dots represent individual elicitations. The dots represent villages. Avar is shown on the X axis, and Turkic is shown on the Y axis. Fig. 2a, similarly to Fig. 2b, the chart shows a very clear divide between the South, the domain of Azerbaijani, and the North, the Avar fiefdom.
For each village, the chart shows the total number of words of Turkic origin (bar 1), those of these words that are shared with Avar, the local lingua franca (bar 2), those that are shared with Avar only (bar 3), those shared with the locally important neighbour, Chechen in the Northeast and Georgian in the Northwest (the Donor 2 in bar 4) and those shared with it only (bar 5) as well as the number of Turkic words that are not shared with either of them. We can see that it is typical that the second bar gets close to the first bar. Essentially, this means that most Turkic loanwords attested in the village are also present in Avar. Turkic loanwords not shared with either of the bigger languages are few across the table. In Kidero, they are not attested at all. There remains a possibility that some of the Turkic loanwords did not enter the minority languages either directly or via a lingua franca but through the mediation of another locally important language or dialect other than lingua franca. This could be Chechen in the Northeast or Georgian in the Northwest. Given that the observed amount of loanwords from Chechen and Georgian in Fig. 2 in the North is almost non-existent, this probability seems to be low. Fig, 5 confirms the expectation; in all cases except Zilo, Turkic loanwords shared with Chechen (Georgian) shown by bar 4 are visibly fewer than those shared with Avar (bar 2).
In Fig. 6, bar 2 corresponds to standard Lezgian and bar 4 corresponds to the local dialect of Lezgian, the two languages that are the closest matches for Avar and Chechen/Georgian in the North. (We cannot use Azerbaijani, the perfect match for Avar, because it is the possibility of mediation of Turkic loanwords that we are considering, and Azerbaijani is a Turkic language and the source of these loanwords in the first place). In addition to the obvious decrease in the number of Turkic loanwords, we see another important difference. Bar 2, the Turkic loanwords that the village shares with standard Lezgian, is much lower than bar 1, the total number of the loanwords, and there are almost no loanwords shared with standard Lezgian only. Essentially, in each village, Turkic loanwords include those that are shared with the local variety of Lezgian (bar 2) and those that are unshared (bar 5). Unlike the North, unshared loanwords may be quite a few. One could conclude, as we did for Avar as a mediator of the North, that in the South many loanwords have been mediated by the local Lezgian. However, unlike the North, bar 5 (the number of the Turkic loanwords attested in the village but not in the major language, Lezgian, or its local variety, Akhty Lezgian) tends to be relatively high, especially in the Tsakhur villages of Mikik, Gelmets and Kurdul. There are quite a few Turkic loanwords in minority languages that are not attested in our Lezgian data, either standard or dialect. Note that these three villages also show the highest amount of Turkic loanwords. Bilingualism rates indicate that the knowledge of Azerbaijani among the Tsakhurs was much higher (between 90 and 100 percent) than in the Lezgian villages in our sample. We thus suggest that it was in fact Tsakhur that could have mediated Turkic lexicon to the local (but not the standard) variety of Lezgian rather than the other way around.
In Fig. 7, the X axis shows the percentage of Turkic loanwords shared with the locally relevant major language. In the North this is Avar. As in Fig. 5 and 6 above, we take standard Lezgian in the South to compare with Avar. On the Y axis, the second potential mediator is shown, Chechen in the Northeast, and Georgian in the Northwest. For the sake of comparison, we take Akhty dialect of Lezgian in the South. The chart shows an articulated difference between the North and the South. As argued above, we interpret it in the following way: most of Turkic influence on the small languages in the North was probably mediated by Avar, while a substantial amount of Turkic lexical influence on the Akhty dialect of Lezgian was probably mediated by the local small languages in a more contact with Turkic, e.g. Tsakhur.
Fig. 9 and 10 show effect plots for Bayesian logistic regression using the area to predict the probability of a concept to be translated by a loanword from a specific language and for a person to be bilingual in a specific language, respectively. We included regression analysis for bilingualism rates because the data on people born before 1919 are in fact scarce enough to require a test for statistical significance together with our data on lexical borrowing. Bayesian logistic regression was used in order to bypass the problems caused by perfect separation in the multilingualism data (e.g. in the Northeast there is not a single L2 speaker of Georgian, whereas in the Northwest there are no L2 speakers of Chechen, which results in the probability of zero at these predictor levels and makes the more frequently used glm and glmer inapplicable). The regression based on multilingualism data was fitted using the bayesglm function from the arm R package (Gelman and Su 2018). The regression for loanword data used the bglmer function from the blme R package (Chung et al. 2013). The regression modelling was conducted as follows. Each entry in the loanword data was annotated for the source language using one dummy variable column per potential donor (Turkic, Avar, Georgian and Chechen). If a word is a loan from a particular language, it was marked with a “1” in the relevant column and “0” in the others, non-loans were marked with “0” in all columns. Multilingualism data were structured similarly: each line of the dataset corresponded to one speaker, four dummy columns encoded their L2 knowledge. Four separate models were fitted, one for each donor language or L2. In the regression based on multilingualism data (Fig. 9), the only predictor was the district. In the loanword regression we introduced a random intercept for every village nested into language to control for eventual differences between languages and villages. The data and the code for the models are provided in the supplement.
The Y axis shows the percentage of loanwords in the elicitation (Fig. 9) and bilinguals in the village (Fig. 10). On the X axis, the three areas are shown. In the figures, the confidence intervals for the effect of the area are shown as vertical lines. We can argue for a statistically significant areal signal when the confidence intervals for the areas do not overlap. Turkic shows significant presence in the South, while Avar is the major stakeholder in the North, with the difference between the Northwest and the Northeast not being statistically significant. Georgian influence is higher in the Northwest than in other areas. As discussed in Section 4, lexical borrowing from Chechen proves not to be significant in the Northeast - where it could be expected geographically - as opposed to other areas. Note that, in terms of bilingualism in Chechen, the Northeast is not different from the Northwest at a statistically significant level; but this is most likely due to the fact that we considered Northeast as a whole, while knowledge of Chechen is only attested in the two villages out of ten, and even there was not very high.
More generally, the figures suggest that the areal differences in terms of bilingualism are stronger than in terms of lexical borrowing. Note that Fig. 9, which is intended to show that areal differences are statistically significant, is based on the conservative, lower counts (i.e. excluding all sharings). If the interpretation of mediated borrowing discussed in Section 4.2 and shown in Fig. 8 is correct, then the difference between lexical borrowing from Avar and Azerbaijani, the two lingua francas, on the one hand, and from locally important languages, Georgian and Chechen, may be stronger.
In our data, there is a possible factor that must be excluded. It could be that the areal differences we observed in the first two models are due to the impact from very few highly borrowable concepts. If these concepts are very few, the fact that those in the South are borrowed from Azerbaijani and those in the North are borrowed from Avar can be due to chance. To exclude this, we have compared the distributions of the probability of borrowing different concepts from a specific donor by using a t-test. The tests were conducted as follows. First, four dummy variables (Turkic, Avar, Georgian and Chechen) were introduced to annotate each word for its donor language. Each entry in the dataset was annotated for its donor language with a “1” value if the word is borrowed from a given source and “0” if it is not. Then, for each concept the mean probabilities were calculated within each region. The sets of means (160 values per region) were then compared. The results of the pairwise t-tests are provided in Table 2. We use the Bonferroni correction for multiple comparison and take .01 as a significance level. Differences that are statistically significant after the Bonferroni correction are indicated by an asterisk.
## # A tibble: 3 x 15
## estimate .y. group1 group2 n1 n2 statistic p df conf.low
## * <dbl> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 -2.36e-4 Turk… North… North… 160 160 -0.0715 9.43e- 1 159 -0.00677
## 2 -1.30e-1 Turk… North… South 160 160 -6.52 8.67e-10 159 -0.170
## 3 -1.30e-1 Turk… North… South 160 160 -6.68 3.71e-10 159 -0.169
## # … with 5 more variables: conf.high <dbl>, method <chr>, alternative <chr>,
## # p.adj <dbl>, p.adj.signif <chr>
## # A tibble: 3 x 15
## estimate .y. group1 group2 n1 n2 statistic p df conf.low
## * <dbl> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0.0284 Avar North… North… 160 160 1.21 2.28e- 1 159 -0.0180
## 2 0.147 Avar North… South 160 160 6.50 9.98e-10 159 0.102
## 3 0.118 Avar North… South 160 160 4.83 3.23e- 6 159 0.0700
## # … with 5 more variables: conf.high <dbl>, method <chr>, alternative <chr>,
## # p.adj <dbl>, p.adj.signif <chr>
## # A tibble: 3 x 15
## estimate .y. group1 group2 n1 n2 statistic p df conf.low
## * <dbl> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 -0.0543 Geor… North… North… 160 160 -3.46 6.92e-4 159 -0.0853
## 2 -0.00716 Geor… North… South 160 160 -1.82 7.00e-2 159 -0.0149
## 3 0.0471 Geor… North… South 160 160 2.89 4.00e-3 159 0.0149
## # … with 5 more variables: conf.high <dbl>, method <chr>, alternative <chr>,
## # p.adj <dbl>, p.adj.signif <chr>
## # A tibble: 3 x 15
## estimate .y. group1 group2 n1 n2 statistic p df conf.low
## * <dbl> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 -0.00112 Chec… North… North… 160 160 -0.275 0.784 159 -0.00915
## 2 -0.00487 Chec… North… South 160 160 -0.887 0.376 159 -0.0157
## 3 -0.00375 Chec… North… South 160 160 -0.579 0.563 159 -0.0165
## # … with 5 more variables: conf.high <dbl>, method <chr>, alternative <chr>,
## # p.adj <dbl>, p.adj.signif <chr>
In Fig. 11, the plot for Turkic is shown in the top left two and for Avar in the top right two graphs, each comparing North (the left distribution) and South (the right distribution). Similarly, the plot for Georgian is shown at bottom left two and for Chechen at bottom right two graphs, each comparing Northeast (the left distribution) and Northwest (the right distribution). As in the regression model, we see that the higher presence of Turkic in the South than in the North and the higher presence of Avar in the North than in the South are statistically significant. The presence of Georgian in Northwest as compared to Northeast is also significant even though much smaller (compare the scale of the X axis), while the presence of Chechen in Northeast is not.
The following R packages were used to create this page: